SNP discovery in proso millet ( Panicum miliaceum L.) using low‐pass genome sequencing

Abstract Domesticated ~10,000 years ago in northern China, Proso millet ( Panicum miliaceum L.) is a climate‐resilient and human health‐promoting cereal crop. The genome size of this self‐pollinated allotetraploid is 923 Mb. Proso millet seeds are an important part of the human diet in many countries. In the USA, its use is restricted to the birdseed and pet food market. Proso millet is witnessing gradual demand in the global human health and wellness food market owing to its health‐promoting properties such as low glycemic index and gluten‐free. The breeding efforts for developing improved proso millet cultivars are hindered by the dearth of genomic resources available to researchers. The publication of the reference genome and availability of cost‐effective NGS methodologies could lead to the identification of high‐quality genetic variants, which can be incorporated into breeding pipelines. Here, we report the identification of single‐nucleotide polymorphisms (SNPs) by low‐pass (1×) genome sequencing of 85 diverse proso millet accessions from 23 different countries. The 2 × 150 bp Illumina paired‐end reads generated after sequencing were aligned to the proso millet reference genome. The resulting sequence alignment information was used to call SNPs. We obtained 972,863 bi‐allelic SNPs after quality filtering of the raw SNPs. These SNPs were used to assess the population structure and phylogenetic relationships among the accessions. Most of the accessions were found to be highly inbred with heterozygosity ranging between .05 and .20. Principal component analysis (PCA) showed that PC1 (principal component) and PC2 explained 19% of the variability in the population. PCA also clustered all the genotypes into three groups. A neighbor‐joining tree clustered the genotypes into four distinct groups exhibiting diverse representation within the population. The SNPs identified in our study could be used for molecular breeding and genetics research (e.g., genetic and association mapping, and population genetics) in proso millet after proper validation.


| INTRODUCTION
Proso millet (Panicum miliaceum L.) is one of the oldest cereal crops known to mankind. It was domesticated approximately 10,000 years ago in the semiarid regions of northern China (Hunt et al., 2014;Lu et al., 2009). Following domestication, the crop spread westward across the Eurasian steppes, being widely cultivated in eastern Europe by 3000 BC (Valamoti, 2016). This minor millet is cultivated on $820,000 and 700,000-1,000,000 ha of farmland in Russia and China, respectively (Vetriventhan et al., 2020;Wang et al., 2016).
German-Russian immigrants brought proso millet seeds with them when they migrated to the United States (Habiyaremye et al., 2017;Santra, 2013;Wietgrefe, 1990) and started early cultivation along the Atlantic coast of North America, which later spread westward into the interior of the continent (Wietgrefe, 1990). Today, proso millet is widely cultivated across the High Plains of the United States, with approximately 200,000 ha of annual production .
Proso millet is prized for both its short growing season (60-90 days) and high water-use efficiency (Briggs & Shantz, 1913;Nielsen & Vigil, 2017). The shallow root system of proso millet, combined with its high water-use efficiency, producing the most grain from the least water of any crop, means that the crop leaves behind more soil moisture for subsequent crops (Nielsen & Vigil, 2018). Proso millet's rapid life cycle and water sparing nature have made it a desirable rotation partner with winter wheat where it can replace a summer fallow, enabling farmers to harvest an additional crop of grain from the same field (Agdag et al., 2001;Anderson et al., 1999;Lyon et al., 2008;Hinze & Smika, 1983;Halvorson et al., 1994).
Proso millet is an essential part of the human diet in many countries in Asia, Europe, and Africa including India, China, South Korea, Japan, and eastern Europe (Carpenter & Hopen, 1985;Das et al., 2019). However, in the USA, the largest single consumer of proso millet is the birdseed and pet food market, with significant additional demand coming from the export and gluten-free health food markets (Graybosch & Baltensperger, 2009;Habiyaremye et al., 2017;Kalinova & Moudry, 2006). The proso millet protein contains a specific prolamin fraction below permissible level, making it a suitable low-GI (glycemic index), gluten-free diet for patients suffering from celiac disease (Kalinova & Moudry, 2006;McSweeney et al., 2017). Proso millet seeds are a good source of carbohydrates, protein, fat, and crude fiber (Motta Romero et al., 2017;Saleh et al., 2013;Vetriventhan & Upadhyaya, 2018). Some phenolic compounds in the seed are known to protect against cancer and heart disease (Kalinová, 2007). All these nutritional and health-promoting properties make proso millet an excellent food for people suffering from serious diseases like type-2 diabetes mellitus, cardiovascular disease, and celiac disease . In recognition of both its climate-friendly and human health-promoting properties, Food and Agricultural Organization (FAO) listed proso millet as one of the future smart crops of the 21st century (Li & Siddique, 2018).
The proso millet genome is relatively small (923 megabases) (Zou et al., 2019). However, the species is an allotetraploid with a chromosome number of 2n = 4x = 36 (Hamoud et al., 1994). The two subgenomes of proso millet are estimated to have diverged from each other $5.6 million years ago (Zou et al., 2019). Substantial genetic redundancy exists between the subgenomes of proso millet with partial redundancy between the two duplicated copies of the GBSSI gene (Hunt et al., 2013).
Genomic resources in proso millet are very limited compared to the major crops as it is largely an under-researched and underutilized crop (Upadhyaya et al., 2014). Understanding the population structure of crop germplasm based on genetic diversity analysis is important for genetic improvement and marker-trait association studies. In proso millet, genetic diversity of the available germplasm has been studied using phenotypic (morphological, agronomical, and seed nutrients) and genotypic data (Hunt et al., 2011;Johnson et al., 2019;Li et al., 2021;Upadhyaya et al., 2011;Vetriventhan et al., 2019). Prior to the recent genome sequence (Shi et al., 2019;Zou et al., 2019), the genomic resources of proso millet were limited to non-sequence-based DNA markers (e.g., RAPD, AFLP, and SSR) (Khound & Santra, 2020;Santra et al., 2019) and sequence-based single nucleotide polymorphisms (SNPs) (Johnson et al., 2019;Wang et al., 2018;Yue et al., 2016). Since the whole genome was published, several populations of proso millet lines have been genotyped for larger numbers of SNPs using either restrictedsite associated DNA (RAD-seq) to genotype 2,412 segregating SNP markers or specific-locus amplified fragment (SLAF-seq) to genotype up to 126,822 SNP markers (Boukail et al., 2021;Li et al., 2021). Asia and a few from other regions such as Europe, West Asia, and India), to study genetic diversity and population structure. The genetic diversity indices, that is, observed heterozygosity (H o ), expected heterozygosity (H E ), and nucleotide diversity (π), of the cultivated proso millet were significantly lower than the weedy types. The wild and feral types of weedy proso millet could also be distinguished using the SNPs (Li et al., 2021).
Both RAD-seq and SLAF-seq employ approaches to target a part of the genomes for sequencing. In principle, this reduces the total quantity of sequence data that must be generated from each individual. However, both approaches increase the complexity of the molecular biology steps necessary to generate libraries for sequencing and necessarily restrict the total number of markers that can be discovered. Here, we employ low coverage whole-genome resequencing to characterize a set of 85 proso millet accessions originating from around the globe. Specifically, we seek to (1) identify segregating markers across within the representative sample of global proso millet germplasm ($700 genotypes), including all the US cultivars, which were not sampled in previous studies, and (2) quantify patterns of population structure and phylogenetic relationships among the proso millet lines using the identified SNPs.

| DNA extraction and Illumina sequencing
Leaf samples for DNA extraction were harvested at phenological stage 9 on the BCCH scale after 4 weeks of planting (Ventura et al., 2020). Leaves were cut into 4 cm ($80 mg) pieces and harvested in 2 ml microcentrifuge tubes. The DNA from the leaf tissues was extracted using the MagMAX Plant DNA Isolation kit following the manufacturer's protocol (Applied Biosystems, Massachusetts, USA). The eluted DNA samples were purified using a 96 Deep-Well KingFisher Flex Magnetic Particle Processor (Thermo Fisher Scientific, Waltham, Massachusetts, USA). The purity of the extracted DNA was quantified using a DS-11 FX + spectrophotometer/fluorometer (DeNovix). Library construction for sequencing was conducted using the iGenomX RIPTIDE high throughput rapid library prep kit (iGenomX, Carlsbad, California, USA) following the manufacturer's instructions (https://igenomx. com/product/riptide/). The samples were subsequently sequenced to an average coverage depth of 1Â utilizing 2 Â 150 bp pairedend reads on the Illumina HiSeq X platform by Psomagen Inc., Rockville, Maryland, USA.

| Data preprocessing and sequence alignment
The resulting paired-end sequence data were quality checked using

| SNP calling and filtering
The sorted BAM files generated after sequence alignment were used to call SNPs using GATK toolkit v 5.1 (Schmidt, 2009

| Data analysis
Allele frequency data were tabulated from the VCF-formatted data using Àfreq option in plink v1.90 (Purcell et al., 2007).  (Bradbury et al., 2007), and the eigenvalues of 10 PCs were selected to create the scree plot to determine the proportion of variance explained by each PC. Further, PC1 was plotted against PC2 for the PCA plot in python package seaborn. A neighbor-joining phylogenetic tree was constructed from the distance between each taxon/genotype for SNPs with MAF > .05 using a TASSEL plugin "Àtree Neighbor" (TASSEL v5). The phylogenetic tree was displayed using iTOL, an online tool for displaying and annotating phylogenetic tree (Letunic & Bork, 2021). Each genotype was color coded following the same coloring scheme as was used in the PCA plot. To determine the relatedness among 13 of the 85 accessions with pedigree information, a SNP-based kinship matrix was estimated using the "-Kinship" plugin in Tassel v5, and was visualized as a heatmap using the "gplots" package in R software. The R statistical package "ggplot2" was used to generate the supplementary bar diagram depicting sequence alignment rates of all the accessions.

| RESULTS
The total number of 150 bp paired-end reads from sequencing 85 proso millet accessions was $632 million, amounting to more than 86 billion bp. The number of the 150 bp paired-end reads and total sequence of each of 85 genotypes are in Table S1.  (Table S1).
• Five of seven East Asian genotypes (PS18, PS25, PS27, PS84, and PS85) were grouped together in Group III and other two (PS24 and PS26) were placed in Group II.
In the heatmap and dendogram (Figure 5), we observed that the relative positions of 13 selected genotypes were in 100% match based on their pedigree, and the following are the supporting observations.
• PS27 (PI436626) was the male parent of PS105, cultivar "Plateau" (Santra et al., 2014), and they were next to each other in the heatmap (Cluster C in Figure 5).
• PS116 and PS117, the duplicated samples of the same genotype "Cerise" were plotted next to each other in the heatmap (Cluster D in Figure 5).
These above observations clearly prove that the SNPs reported in our study are non-random. We understand that, perhaps, a better way of verification is to repeat sequencing of a sub-set of genotypes and compare the SNPs. Unfortunately, we do not have resources (personnel and fund) to support this additional experiment because the project has ended, and no budget is left. Therefore, we decided not to conduct the additional sequencing.

| DISCUSSION
The sequencing and assembly of the proso millet genome have enabled a new era of proso millet breeding and genetics by enabling the use of high throughput sequencing-based strategies for discovering genetic markers, which can enhance molecular breeding, quantitative genetic, and population genetic analyses. Here, we identified and scored 972,863 SNPs from 85 accessions of proso millet originated from 23 countries spanning different regions of the globe.
In the current report, the first 10 PCs were found to explain $33% of the total variation. This finding is consistent with the observations reported by Miao et al. (2019) in several other crop species.
They observed that the first 10 PCs in foxtail millet, sorghum, maize, and rice could explain approximately 30%, 40%, 40%, and 15% of the variation, respectively. Therefore, this current result in proso millet is A similar clustering of the North American genotypes based on SSR marker analysis was also reported in our previous study  showing the genetic relatedness among 13 selected proso millet genotypes estimated using the SNP data. The relatedness among the genotypes was calculated in a kinship matrix which is graphically presented in the figure. The red color and its gradients indicate close association between genotypes, while blue correlates to distantly related genotypes. Santra et al., 2019). In other words, the majority (20 of 23) of the genotypes in this group have an origin of either Europe (4) or North America (16). In addition, the countries of origin of the four European lines are Hungary (PS17), Romania (PS72), Germany (PS76), and Russian (PS86). This is not unexpected considering the history and origin of proso millet cultivation in the United States.
Proso millet was brought to the United States by the German-Russian immigrants who started cultivating this crop along the eastern Atlantic coast (Habiyaremye et al., 2017;Santra, 2013;Wietgrefe, 1990).
Accessions in both Groups III and IV are not represented by a single major country or region. Group III had genotypes from East Asia SNPs (Hunt et al., 2011;Khound & Santra, 2020;Santra et al., 2019).
This is very much expected considering the history of Asian origin and Eurasian distribution of proso millet, which were elegantly illustrated earlier (Hunt et al., 2011;Lu et al., 2009).
The 972,863 SNPs we identified from a diverse population could serve as a valuable resource for conducting GWAS in proso millet.
These SNPs can also be used for marker-assisted selection (MAS) after proper validation. The relationships among the accessions depicted by the phylogenetic tree could be used for selecting parents for QTL mapping and cultivar development. The findings from this study may encourage researchers working on other minor crops to adopt this approach to identify genetic variants cost-effectively. Further studies can be conducted on the efficiency of this approach in identifying high-quality SNPs and other genetic variants in comparison to high-coverage sequencing.

ACKNOWLEDGMENTS
We would like to thank the Holland Computing Center (HCC) at the University of Nebraska-Lincoln for providing the high-computing resources for running most of our analyses. We are grateful to David Brenner of the USDA North Central Regional Introduction Station, Ames, Iowa, for providing us with the germplasm for this project. We also acknowledge the suggestions we received from